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(54) Title: LANGUAGE TRAINING 



(57) Abstract 

A speech synthesizer (3) produces prompts in the voice 
of a native speaker of the language to be learned, which the stu- 
dent may imitate or reply to, and a phrase recogniser (1) which 
uses keyword recongnition is employed so that the system un- 
derstands spoken phrases and interactive dialogue may take 
place. The student's progress is monitored by measuring the de- 
viation from his original speech recognition template; when this 
difference is sufficientty large that the recogniser (1) can no 
longer recognise what the student is saying, the system re-trains 
and updates the template. In another embodiment, the system 
includes a display which shows the native speaker's mouth 
shape whilst the words to be imitated are spoken by the speech 
synthesizer (3) ; and a video pick-up and analyser for analysing 
the shapes of the student's mouth to give the student visual 
feedback. 
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LANGUAGE TR&ININS 



This invention relates to apparatus and methods for 
training pronunciation; particularly, hut not exclusively, 
for training the pronunciation of second or foreign 
languages . 

One type of system used to automatically translate 
speech between different foreign languages is described in 
our European published patent application number 
0262938a- This equipment employs speech recognition to 
recognise words in the speaker's utterance, pattern 
matching techniques to extract meaning from the utterance 
and speech coding to produce speech in the foreign tongue. 

This invention uses s imilar technology r but is 
configured in a different way and for a new purpose, that 
of training a user to speak a foreign language* 

This invention uses speech recognition not only to 
recognise the words being spoken but also to test the 
consistency of the pronunciation. It is a disposition of 
novice students of language that, although they are able 
to imitate a pronunciation , they are liable to forget, and 
will remain uncorrected until they are checked by an 
expert. A machine which was able to detect 
mispronunciation as well as translation inaccuracies would 
enable students to reach a relatively high degree of 
proficiency before requiring the ass istai&e of a 
conventional language teacher to progress further, 
indeed , very high levels of linguistic skill az^e probably 
not required in the vast majority of communication tasks, 
such as making short trips abroad or using the telephone, 
and computer aided language training by itself may be 
sufficient in these cases. 



conventional methods either involve expensive skilled 
human teachers, or the use of passive recordings of 
foreign speech wliich do not test the quality of the 
student's pronunciation* 

Some automated systems provide a visual display of a 
representation of the student's speech , and the student is 
expected to modify his pronunciation until this display 
matches a standard . This technique suffers from the 
disadvantage that users must spend a great deal of time 
experimenting and understanding how their speech relates . 
to the visual representation. 

another approach (described for example in Revue de 
Physique appliquee vol 18 no. 9 Sept 19S3 pp 595-610, K-T- 
Janot-Giorgetti et al, "Utilisation d'un systeme de 
reconnaissance de la parole comme aide a ^acquisition 
orale d'une langue etrangere") employs speaker independent 
recognition to match spoken utterances against standard 
templates* A score is reported to the student indicating 
how well his pronunciation matches the ideal. However r 
until speaker independent recognition technology is 
perfected,, certain features of the speakers voice, such 
as pitch, can affect the matching scores, and yet have.no 
relevant connection with the quality of pronunciation, a 
student may therefore be encouraged to raise the pitch of 
his voice to improve his score, and yet fail to correct an 
important mispronunciation. 

Furthermore, current speaker independent recognition 
technology is unable to handle more than a small 
vocabulary of words without producing a very high error 
rate. !this means that training systems based on this 
technology are unable to process and interpret longer 
phrases and sentences, a method of training pronunciation 
for deaf speakers is described in Procedings ICASSP 87 vol 
1 pp 372-375 D» Kewley-Port et al 'Speaker^dependant 



Recognition as the Basis for a Speech Training aid'. In 
this method, a clinician selects the best pronounced 
utterances of a speaker and these are converted into 
templates. The accuracy of the speaker's subsequent 
pronunciation is indicated as a function of his closeness 
to the templates (the closer the better). This system has 
two disadvantages; firstly, it relies upon human 
intervention by the clinician, and secondly the speaker 
cannot improve his pronunciation over his previous best 
utterances but only attempt to equal it. 

according to the invention there is provided 
apparatus for pronunciation training comprising? 

- speech generation means for generating 

utterances; and 

- speech recognition means arranged to recognise in 
a trainee's utterances, the words from a predetermined 

selected set of words, 

wherein the speech recognition means is arranged to 
employ speaker-dependent recognition, by comparing the 
trainee's utterance with templates for each word of the 
set, and the apparatus is arranged initially to generate 
the templates by prompting the trainee to utter each word 
of the set and forming the templates from such utterances, 
the apparatus being further arranged to indicate 
improvements in pronunciation with increases in the 
deviation of the trainee's subsequent utterances from the 
templates. 

Some non-limitative examples of embodiments of the 
invention will now be described with reference to the 

drawings, in whichs 

- Figure 1 illustrates stages in a method of 
language training according to one aspect of the invention? 

- Figure 2 illustrates schematically apparatus 
suitable for performing one aspect of the invention? 

6 



Figure 3 illustrates a display in an apparatus 
for ^language training according to another aspect of the 
invention - 

Referring to Figures 1 and 2, upon first using the 
system illustrated, the student is asked by the system 
(using either a screen and keyboard or conventional speech 
synthesiser and speaker independent recogniser) which 
language he wishes to study, and which subject area (eg 
operating the telephone or booking hotels) he requires. 
The student then has to carry out a training procedure so 
that the speaker dependent speech recogniser 1 can 
recognise his voice* To this end, the student is prompted 
in the foreign language by a speech generator 3 employing 
a pre-recorded native speaker's voice to recite a set of 
keywords relevant to his subject area. At the same time, 
the source language translation of each word is displayed, 
giving the student the opportunity to learn the 
vocabulary. This process, in effect, serves as a passive 
learning stage during which the student can practise his 
pronunciation, and can repeat words as often as he likes 
until he is satisfied that he has imitated the prompt as 
accurately as he believes he can* 

A control unit 2 controls the sequence of prompts and 
responses * conveniently, the control unit may be a 
personal computer (for example, the IBM PC). 

These utterances are now used as, or to generate,, the 
first set of templates stored in template store la to be 
used by the speech recogniser 1 to process the student's 
voice. The templates represent the students first 
attempt to imitate the perfect pronunciation of the 
recorded native speaker. 

The second stage of the training process simply tests 
the ability of the student to remember the translations 
and pronunciations of the key word vocabulary. He is 
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prompted in his source language {either visually/ on 
screen 4, or verbally by speech generator 3> to pronounce 
translations of the keywords he has practised in the 
previous stage. After each word is uttered, the speech 
generator 3 repeats the foreign word recognised by the 
recogniser 1 back to the student and displays the source 
language equivalent. Incorrect translations are noted for 
re-prompting later in the training cycle. The student is 
able to repeat words as often as he wishes, either to 
refine his pronunciation or to correct a machine 
misrecognition. If the recogniser 1 consistently (more 
than, say, 5 times) misrecognises a foreign word, either 
because of a low distance score or because two words are 
recognised with approximately equal distances, the student 
will be asked to recite this word again (preferably 
several times), following a native speaker prompt from the 
generator 3, so that a new speech recognizer template can 
be produced to replace the original template in store la. 
Such action in fact indicates that the student has changed 
his pronunciation after having heard the prompt several 
more times, and is converging on a more accurate imitation 
of the native speaker. This method has the advantage over 
the prior art that the trainee's progress is measured by 
his deviation from his original {and/or updated) template, 
25 rather than by his convergence on the native speaker's 
template, thus eliminating problems due to pitch, or 
other, differences between the two voices - Once the 
student is satisfied that he has mastered the key word 
vocabulary, he may move to the third training stage. 
30 The student is now prompted in his own language 

( either visually on screen 4 or verbally through generator 
3) and may be asked to carry out verbal translations of 
words or complete phrases relevant to his subject area of 
interest. Alternatively, these prompts may take the form 



20 



of a dialogue in the foreign language to which the student 
must respond. One useful method of prompting is a 
■storyboard' exercise using a screen display of a piece of 
text, with several words missing , which the student is 
prompted to complete by uttering what he believes are the 
missing words* The system now preferably operates in the 
same manner as the phrase-based language translation 
system (European Published Application No 0262938) and 
recognises the pte-trained keyirords in order to identify 
the phrase being uttered* The system then enunciates the 
correct response/translation back to the student in a. 
native speaker's voice, and gives the student an 
opportunity to repeat his translation if it was incorrect, 
if he was not happy with the pronunciation/ or if the 
recogniser 1 was unable to identify the correct foreign 
phrase. In the event that the student is unable to decide 
whether the recogniser 1 has assimilated his intended 
meaning, the source language version of the recognised 
foreign phrase can be displayed at the same time* 
Incorrectly translated phrases are re-presented (visually 
or verbally) to the student later in the training cycle 
for a further translation attempt* 

If the recogniser 1 repeatedly fails to identify the 
correct phrase because of poor key word recognition and 
drifting student pronunciation, the student will be asked 
to recite each key word present in the correct translation 
for separate recognition* If one or more of these 
keywords is consistently misrecognised, .new templates are 
generated as discussed above* 

Phrases are presented to the student for translation 
in an order which is related to their frequency of use in 
the domain of interest* The system preferably enables the 
trainee to suspend training at any point and resume at a 
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later time, so that he is able to progress as rapidly or 

as slowly as he wishes. 

The preferred type of phrase recognition (described in 
European Published Application No 0262938 and • Machine 
Translation of Speech' Stentiford & Steer r British Telecom 
Technology Journal Vol 6 No. 2 April '88 pp 116-123) 
requires that phrases with variable parameters in them 
such as dates, times, places or other sub-phrases, should 
be treated in a hierarchical manner. The form of the 
phrase is first identified using a general set of 
keywords. Once this is done, the type of parameter 
present in the phrase can be deduced ana a special set of 
keywords applied to identify the parameter contents. 
Parameters could be nested within other parameters. As a 
simple example, a parameter might refer to a major city in 
which, case the special keywords would consist of just 
these cities. During student training translation, errors 
in parameter contents can also be treated hierarchically. 
If the system has identified the correct form of phrase 
spoken by the student, but has produces an incorrect 
parameter translation, the student can then be coached to 
produce the correct translation of the parameter in 
isolation, without having to return to the complete phrase. 

Parameters are | normally selected in a domain of 
discourse because of their occurrence across a wide range 
of phrases. It is natural therefore that the student 
should receive specific training on these items if he 
appears to have problems with them. 

The keywords are selected according to the information 
they bear, and how well they distinguish the phrases used 
in each subject area. This means that it is not necessary 
for the system to recognise every word in order to 
identify the phrase being spoken. This has the advantage 
that a number of; speech recognition errors can be 




tolerated before phrase identification is lost. 
Furthermore, correct phrases can be identified in spite of 
errors in the wording which might be produced by a 
novice. It is reasonable to conjecture that, if the 
system is able to match attempted translations with their 
corrected versions, such utterances should be intelligible 
in practice when dealing with native speakers who are 
aware of the context* This means that the system tends to 
concentrate training on just those parts of the student's 
diction which give rise to the greatest ambiguity in the 
foreign language- This might be due to bad pronunciation 
of important keywords or simply due to their omission. 

The described system therefore provides an automated 
learning scheme which can rapidly bring language students 
up to a minimum level of intelligibility f and is 
especially useful for busy businessmen who simply wish to 
e^edite their transactions, or holiday-makers who are not 
too worried about grammatical accuracy • 

The correct pronunciation of phrases is given by the 
recorded voice of a native speaker, who provides the 
appropriate intonation and co-articulation between words* 
The advanced student is encouraged to speak in the same 
manner, and the system will continue to check each 
utterance, providing the word spotting technology employed 
is able ta cope with the increasingly fluent speech ♦ 

Referring to Figure 3 r in another aspect of the 
invention, a visual display &f the mouth of the native 
speaker is provided so as to exhibit the articulation of 
each spoken phrase. This display may conveniently be 
provided on a CRT display using a set of quantised mouth 
shapes as disclosed in our previous European Published 
Application Ho. 0225729a. A whole facial display may also 
be used* 



i 



WO 90/01203 PCT/GBS9/00846 

- 9 - 



10 



in one simple embodiment , the display may be mounted 
in conjunction witfr a mirror so that the applicant may 
imitate the native speaker. 

in a second embodiment, a videophone coding apparatus 
of the type disclosed in our previous European Published 
Application Mo. 0225729 may be employed to generate a 
corresponding display of the student's mouth so that he 
can accurately compare his articulation with that of the 
native speaker. The two displays may be simultaneously 
replayed by the student, either side by side, or 
superimposed {in which case different colours may be 
employed), using a time-warp method to align the displays. 
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OATHS 

1. apparatus for pronunciation training comprising.; 

speech generation means for ■ generating 
utterances; and 

5 - speecii recognition means arranged to recognise in 

a trainee's utterances, tlie words from a predetermined 
selected set of words, 

wherein the speech recognition means is arranged to 
employ speaker-dependent recognition, by comparing the ~* 
; 10 trainee's utterance with templates for each word of the ... 

- set, and the apparatus is arranged initially to generate 
the templates by prompting the trainee to utter each word 
of the set and forming the templates from such utterances, 
the apparatus being ftirther arranged to indicate 
is improvements in pronunciaticm with increases in the 

deviation of the trainee's subsequent utterances from the 
templates* 

. 2- Apparatus according to claim 4, further arranged 

. to update the templates from the said subsequent 
20 . utterances when the said deviation exceeds a predetermined: 

threshold. 

3* apparatus according to claim 1 or claim 2 further 

comprising control means connected to the speech 
generation means and to the speech recognition means , and 

25 arranged so anticipated that, in use, the apparatus 

generates a prompt to which a trainee may respond by 
, speaking, the speech recognition means is arranged to 
recognise in the trainee's response the presence of words 
from the said set, and the speech generation means is 

30 arranged to generate an utterance in dependence on what 

the speech recognition means has recognised* 

4 



4. Apparatus for pronunciation training according to 

claim 3, further comprising; 

phrase recognition means for identifying phrases 
by the combination and order of words from the said 
predetermined selected set, 

wherein in use the trainee is prompted to respond by 
uttering a phrase, the phrase recognition means recognises 
the phrase and the utterance generated by the speech 
generation means is thereby selected to be a reply to the 
phrase* 

5 P Apparatus for pronunciation training according to 

claim 3 or claim 4, wherein the prompt is an utterance 
generated by the speech generation means. 

6. Pronunciation training apparatus comprising 
speech generation means for generating utterances, and 
video generation means for generating corresponding video 
images of a mouth, whereby a trainee is prompted to 
imitate the correct pronunciation of the said utterances. 

7. Apparatus according to claim 6, further 
comprising video analysis means arranged to analyse mouth 
movements of the trainee and to display the corresponding 
synthesised and analysed mouth movements. 

8- Language training apparatus according to any 

preceding claim, wherein the speech generation means is 
arranged to generate utterances in a language in the 
accent of a native speaker of that language. 

g. A method of pronunciation training, comprising? 

prompting a trainee to speak an utterance, and 
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analysing the utterance using speaker-dependent 
speech recognition, employing templates derived from the 
trainee's previous utterances? 

whereby improvements in pronunciation are assessed by 
5 measuring the distance between the utterance and the 

template; the assessment being such that an increase in 
distance corresponds to a pronunciation improvement- 

10. a method according to claim 9 further comprising 
the step ofr updating the said templates when the said 

10 distance exceeds a predetermined threshold* 

11. a method of pronunciation training comprising 
employing apparatus according to any one of claims 1 to 6. 

12. - Apparatus for pronunciation training 
substantially as herein described with reference to Figure 

1 5 1 and Figure 2, or Figure 3. 

13. A method of pronunciation training substantially 
as herein described with reference to Figure 1 or Figure 3* 
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